Search CORE

32 research outputs found

Self-supervised learning of a facial attribute embedding from video

Author: Koepke A. Sophia
Wiles Olivia
Zisserman Andrew
Publication venue
Publication date: 01/01/2018
Field of study

We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm

arXiv.org e-Print Archive

Oxford University Research Archive

AXES at TRECVid 2011

Author: Aly R.B.N. (Robin)
Arandjelovic R. (Relja )
Beunders H.J.G. (Henri)
Chen S. (Shu)
Frappier M. (Mathieu )
Juneja M. (Mayank)
Kleppe M. (Martijn)
McGuinness K. (Kevin)
O'Connor N.E. (Noel)
Ordelman R.J.F. (Roeland)
Schneider D. (Daniel )
Schwenninger J. (Jochen)
Tschopel S. (Sebastian )
Zisserman A. (Andrew)
Publication venue
Publication date: 12/12/2011
Field of study

Abstract The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments

Erasmus University Digital Repository

The PASCAL Visual Object Classes (VOC) Challenge

Author: A. B. Torralba
A. B. Torralba
Andrew Zisserman
B. Russell
Christopher K. I. Williams
D. Lowe
D. Scharstein
E. B. Sudderth
G. Salton
J. Demsar
J. Zhang
John Winn
L. Fei-Fei
Luc Van Gool
M. Everingham
Mark Everingham
N. Pinto
P. A. Viola
R. Fergus
V. Ferrari
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension. © 2009 Springer Science+Business Media, LLC

Lirias

CiteSeerX

Crossref

Edinburgh Research Explorer

Oxford University Research Archive

Reading to listen at the cocktail party: multi-modal speech separation

Author: Afouras T
Rahimi A
Zisserman Andrew
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/09/2022
Field of study

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3

Int J Comput Vis DOI 10.1007/s11263-008-0139-3 Learning an Alphabet of Shape and Appearance for Multi-Class Object Detection

Author: A. Opelt
A. Opelt
A. Pinz
A. Zisserman
Andreas Opelt
Andrew Zisserman
Axel Pinz
Publication venue
Publication date: 01/01/2008
Field of study

Abstract We present a novel algorithmic approach to object categorization and detection that can learn category specific detectors, using Boosting, from a visual alphabet of shape and appearance. The alphabet itself is learnt incrementally during this process. The resulting representation consists of a set of category-specific descriptors—basic shape features are represented by boundary-fragments, and appearance is represented by patches—where each descriptor in combination with centroid vectors for possible object centroids (geometry) forms an alphabet entry. Our experimental results highlight several qualities of this novel representation. First, we demonstrate the power of purely shape-based representation with excellent categorization and detection results using a Boundary-Fragment-Model (BFM), and investigate the capabilities of such a model to handle changes in scale and viewpoint, as well as intra- and inter-class variability. Second, we show that incremental learning of a BFM for many categories leads to a sub-linear growth of visual alphabet entries by sharing of shape features, while this generalization over categories at the same time often improves categorization performance (over independently learning the categories). Finally, the combination of basic shape and appearance (boundary-fragments and patches) features ca

CiteSeerX

Oxford University Research Archive